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ABSTRACT 

^Q^l We explore the prospects of predicting emission line features present in galaxy spectra 

1 . given broad-band photometry alone. There is a general consent that colours, and spectral 
O ■ features, most notably the 4000 A break, can predict many properties of galaxies, including 

star formation rates and hence they could infer some of the line properties. We argue that these 
^ ■ techniques have great prospects in helping us understand line emission in extragalactic objects 

' and might speed up future galaxy redshift surveys if they are to target emission line objects 

only. We use two independent methods, Artifical Neural Neworks (based on the ANNz code) 
and Locally Weighted Regression (LWR), to retrieve correlations present in the colour N- 
^ \ dimensional space and to predict the equivalent widths present in the corresponding spectra. 

. We also investigate how well it is possible to separate galaxies with and without lines from 

broad band photometry only. We find, unsurprisingly, that recombination lines can be well 
predicted by galaxy colours. However, among collisional lines some can and some cannot 
be predicted well from galaxy colours alone, without any further redshift information. We 
also use our techniques to estimate how much information contained in spectral diagnostic 
diagrams can be recovered from broad-band photometry alone. We find that it is possible to 
■ classify AGN and star formation objects relatively well using colours only. We suggest that 

this technique could be used to considerably improve redshift surveys such as the upcoming 
FMOS survey and the planned WFMOS survey. 



<N 



O 



Key words: Galaxy formation: Emission line galaxies - Cosmology: Redshift surveys 



1 INTRODUCTION 

Current galaxy formation models can be very successful in pre- 
dicting the general spectral energy distribution (SED) of galaxies 
at moderate redshifts by evolving stellar population models and 
CO- adding a certain number of stellar populations with different 
ages (e.g. Le Borg ne et al .120041) . These models have become main- 
stream in obtaining stellar population properties (like ages) given 
current data. With them, constraints can be imposed on the forma- 
tion times of old elliptical galaxies, the epoch of reionization and 
cosmology (e.g. Jimenez & Loeb 2002; Ferreras & Yi 2004). 

However most stellar population models do not include the 
modelling of emission lines in galaxies. This is mainly due to 
the need for modelling the different phases present in the inter- 
stellar medium. Some spectrophotometric codes such as Pegase 
Ce Borgne et al. 2004) do include simple prescript ions for emis- 
sion lines. Other models such as Starburst99 dLeitherer et al.lll999h 
actually provide equivalent width and flux estimates for the most 
important emission lines usually observed in galaxy spectra. 

The broad band shape of the SED distribution can be related 



empirically to the presence of an emission line. Roughly speaking, 
we know that the emission line strength depends on the condition 
of the gas inside a galaxy; this is strongly related to the amount 
of star formation ongoing in this galaxy; the star formation can be 
inferred to a certain extent from the colour, i.e. we know that the 
SED of a star burst galaxy is much flatter than the corresponding 
SED of a more passively evolving red galaxy. Therefore the shape 
of the SED must be correlated in some way to the presence of a 
strong emission feature in the same galaxy. 

We propose to find empirical correlations between the equiv- 
alent widths of different lines and the broad band colours from 
imaging data by utilising novel statistical methods. One method, 
Artificial Neural Netwo r ks, is based on t he ANNz code (e.g . 
ICoUister & Lahavl l2004 iBoris et al.l l2007l : lAbdalla et all l2007h . 
previously used to predict photometric redshifts from galaxy 
colours. The second statistical method we explore here is Locally 
Weighted Regression (LWR). We can then use such correlations to 
understand better the process of galaxy formation and the condi- 
tions under which strong lines are produced. We suggest that with 
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Figure 1. The redshift distribution for both of the samples considered in this work. The median redshifts of SDSS and DEEP2 are z ~ 0.1 and 1.0 respectively. 



these statistical results, any model attempting to predict emission 
features in galaxy spectra could be considerably improved by be- 
ing calibrated to have the same statistical properties as found in 
nature. 

Another strong motivation for this work is the predictability 
of an emission feature present in high redshift galaxies. The Sloan 
Digital Sky Survey (SDSS) and the 2dF Galaxy Redshift Survey 
(2dFGRS) have probed the nearby Universe. More current (VVDS 
and DEEP2) as well as future efforts will attempt to probe a large 
part of the high redshift Universe. More specifically if one is only 
interested in redshift surveys and not on the more specific SED 
of a galaxy then it is important to minimize the amount of time 
spent looking for a galaxy redshift. The most efficient way of doing 
this is to perform redshift surveys targeted at emission line objects. 
If there is a possibility of predicting statistically which galaxies 
have strongest emission lines then the potential time spent locating 
all these galaxies in redshift space can be reduced and the scope 
for larger surveys increased in the same way. These approaches 
would be useful in particular for the forthcoming FMOS survey 
toalton et al.ll 2006) and the planned WFMOS survey. 

The outline of this paper is as follows. In Sec|2]we present the 
data used to provide correlations found, we examine a low redshift 
sample (SDSS) as well as a high redshift sample (DEEP2). Section 
3 presents the methodology used to extract the correlations from the 
data and describes the results found, we attempt to use LWR and 
ANNz for this purpose. In Section 4 we present methodology for 
classification of AGNs and starforming galaxies. We comment on 
the prospects that this technique has in speeding up future redshift 
surveys in Section 5 and conclude in Section 6. We also provide 
two short appendices describing the mathematics of the LWR and 
ANNz methods used. 



2 THE DATA 
2.1 SDSS 

The low redshift data used in th is work is a small sa mple taken 
from the SDSS Data Release 2 (l Abazaiian elaP 1200 4). Photom- 
etry is available for all galaxies in five optical bands (u, g, r, i 
and z). We considered the same flux limited sample described in 



jStasinska et al.ll200^ . This is a random sample of 20000 galax- 
ies with reddening-corrected Petrosian r-band magnitudes r ^ 
17.77, and Petrosia n r-band half-light surface brightness /iso ^ 
24.5 mag arcsec"^ jStrauss et aLll2002l) . As a quality cut, we se- 
lected only the objects that show a signal-to-noise (S/N) ratio 
greater than 5 in the ^f, r and i bands. We plot the redshift dis- 
tribution of this sample in FigH] 

The SDSS spectra used here cover the wavelength range 
3800-9200 A, and have a spectral resolution of ^ 1800. They were 
taken with the standard 3 arcsec diameter fibres in the SDSS spec- 
trograph. The spectra are first corrected for Galactic extinction us- 
in g the maps of ISchlegel et al.l (Il998t) and usins the extinction law 
of ICardelli et alJ (Il989l) . They are then brought to the rest frame 
and resampled from 3400 to 8900 A in steps of 1 A with a flux nor- 
malization by the median flux in the 4010-4060 A region. These 
procedures are necessary to the spectral analysis described in sec- 
tionl231 

2.2 DEEP2 

The d ata sample iised at high redshift was the DEEP2 first data re- 
lease (iDavis et alJl2003h . The DEEP project is an ongoing project, 
producing spectroscopy for targets at a redshift range 0.75 < 
z ^1.5. The surveys uses the DEIMOS spectrograph (Fa ber et aP 
I2OO3I) on the Keck II telescope and aims to target around 40000 
galaxies over 3 square degrees. The targets are pre- selected from 
imaging with B, R and I filters taken with the CFH12k c amera on 
the Canada-France-Hawaii telescope (e.g. Coil et al. 2004). Galax- 
ies are imaged in R and / bands and selected to Rab < 24.1. 
Furthermore, a colour-cut is applied to pre- select high-redshift ob- 
jects above a redshift of z > 0.7 and only :^ 3 per cent of galaxies 
above this redshift are rejected by the colour cut. The spectra are 
taken at moderately high resolution (R ^ 5000) and span the range 
6300 < A < 9100 A. Hence the doublet [O II] A3727 is found from 
0.7 < 2; < 1.4. This sample contains 4681 objects and its redshift 
distribution is plotted in Fig[T] 

2.3 Emission line measurements 

In order to measure the emission lines from the SDSS and DEEP2 
galaxy spectra we have used a code to fit them as Gaussian func- 
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Figure 2. Equivalent widths of the emission hnes [O II], H/3, [O III], [O I], Ha, and [N II], measured from the SDSS spectra as a function of the colours (u-r), 
(g-r), (u-i), and (g-i). The top-right numbers are the total number of galaxies in each panel and the Spearman rank correlation coefficient. The grey scale level 
represents the number of galaxies in each pixel, darker pixels having more galaxies. 
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tions, composed of three parameters: width, offset (with respect 
to the rest-frame central wavelength) and flux. In the case of the 
SDSS sample, we measured emission lines using the met hods of 
descri bed in detail by Cid Fernandes et al. ( 2005) and Mate us et alJ 
The following emission lines have been measured: [O ll], 
H/3, [O III], [O I], Ha, and [N lllQ Lines from the same ion are 
assumed to have the same width and offset, and we consider the 
following flux ratio constraints: [O lll]A5007/[O lll]A4959 = 2.97 
and [N ll]A6584/[N ll]A6548 = 3. To measure the intensities of 
emission lines we have to remove the stellar contribution in the con- 
tinuum at their wavelength range, mainly to account for absorption 
features in the emission line regions. This is done by computing for 
each SDSS galaxy a synthetic stellar spectrum obtained from a lin- 
ear combination of simple stellar population spectra that fits the ob- 
served continuum in the whole spectral range (after removal of the 
zones of emission lines and bad pixels). Removing this synthetic 
spectrum from the observed one leaves us with a pure emission line 
residual spectrum from which we can easily measure the emission 
lines. This method provides a reliable estimate of the stellar absorp- 
tion in the entire spectrum, including the windows where emission 
lines are found. This procedure is very important since the regions 
of some emission lines (mainly the Ha and H/3 B aimer lines) can 
be contaminated by strong absorption features which may reduce 
the equivalent width of the lines. For lines which have no absorp- 
tion this should make no difference. We also have compared results 
with integrating over the spectrum and comparing it to the contin- 
uum and results do not change much showing that the method is 
robust. 

For the DEEP2 data, only the [O ll] line provided suitable 
measurements. Moreover, we have adopted a distinct approach to 
measure this emission line. Instead of fitting the continuum with 
a synthetic spectrum we have performed a polynomial fit of two 
continuum windows (3653-3713 and 3741-3801 A) around the 
line. The emission line was then measured from the continuum sub- 
tracted spectrum through the Gaussian fitting procedure. 



3 METHODOLOGY: PREDICTING EMISSION 
FEATURES WITH BROAD BAND PHOTOMETRY 

It is well know that galaxy colours present a good correlation with 
the emission line properties of galaxies, such as their equivalent 
widths (EW). We illustrate this in Fig. 2, where the equivalent 
widths of emission lines measured from the SDSS spectra are plot- 
ted as a function of various galaxy colours. Several correlations can 
be identified between these quantities, as confirmed by the high val- 
ues of the Spearman rank correlation coefficients obtained for some 
of them, particularly those involving the Ha and H/3emission lines. 
Therefore, it is not only the 4000 A break colour indice that encodes 
information about the emission features in galaxies (e.g. Mateus et 
al. 2006), as typical galaxy colours can also be used to infer these 
properties in a convenient manner, from photometric data only. On 
this basis, here we use different methods to combine all informa- 
tion available in the galaxy colours to produce an empirical relation 
between colours and emission strength. 

We have analyzed the data with two different techniques to 
reinforce the robustness of the correlations found and to compare 
the two techniques used. We have used Artificial Neural Networks 

1 In the entire paper [O ii] stands for [O ii] A3727, [O ill] for [O ill] A5007, 
[O I] for [O i]A6300, and [N ii] for [N ii]A6584. 
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Figure 3. The scatter plot for the 11 00 galaxies that were used as a testing 
set for the prediction of the equivalent width of the [O ii] line in the DEEP 
sample. The Artificial neural network method used in this plot. We can see 
a correlation in the log of the predicted equivalent width and the actual 
measurement. The scatter in the log is of 0.27. Similar results with virtually 
the same scatter are found with the LWR technique. 



(lBishodll995h . as implemented in the publically available ANNz 
code (Collister & Lahav 2004 and refereneces therein) and a Lo- 
cally Weighted Regression (LWR) (lAtkeson et all 1 19971) . See the 
Appendices for the mathematical details and implementation of 
these methods. 

Both of the techniques we present here rely on a training set 
which is representative of the true population of galaxies in order 
to retrieve the information on the line features in each galaxy. We 
have separated the sample into two groups, one group which we 
have used to train the machine learning algorithms and a second 
sample which we will name the testing set which is set aside and 
then used to test the reliability of the methods. We have set aside 
1100 galaxies in DEEP and 14000 galaxies in the SDSS as testing 
sets. 

The rest of the sample was used as to train the algorithms, 
that is 2000 galaxies to train in the DEEP sample and 6000 galax- 
ies to train in the SDSS sample. For both methods used here we 
have to subdivide this sample into a training sample and a valida- 
tion sample. In the case of neural networks this is done to prevent 
over-fitting. In the case of the LWR method, this is done to obtain 
the best kernel value K which is appropriate to the data. Both pre- 
scriptions are explained in the Appendix. The architecture of the 
neural network used is described in the Appendix and is of the type 
N:2N:2N:1 where N is the number of magnitudes available which 
is three in the case of DEEP data and five in the case of SDSS. 

With the DEEP sample the only line which we have produced 
a fit for was the [O ll] line. We have performed this analysis only 
on this line because the DEEP survey relies heavily on the [O ll] 
line for redshift estimation, therefore having a high completeness. 
Other lines do not appear in a large fraction of the galaxies in the 
entire sample. In the sample from SDSS we performed this analysis 
for all measured lines: [O ll], H/3, [O III], [O I], Ha, and [N ll]. 

We plot in Fig|3]our attempts to recover the equivalent width 
of the [O II] line in the DEEP survey using broad band photometry 
only. We have plotted a scatter plot of the real equivalent width 
versus the predicted equivalent width for the testing set given the 
training set. As we can see the logarithm of the equivalent width is 
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Figure 4. The scatter plot of the log of the real equivalent width versus the predicted equivalent width of galaxies for different lines. For all the plots we have 
used only galaxies in the testing set which have not made part on the training of the machine learning algorithm that has made the prediction. We present here 
results using both methods described in this paper. As we can see the results found with both results are comparable and we found no evidence that one method 
performed better than the other given the data volume available. We can also see that some lines present a better correlation than others and that generally 
recombination lines are better predicted than coUisional lines. 



predicted reasonably well and show that this weak correlation can 
be inferred very well with the techniques we have proposed here. 
We can predict the log of the equivalent width with an error of 
0.27 without any spectroscopic information on the redshift of each 
galaxy. 

We present in Fig|4]the results we find by fitting different line 
widths to SDSS galaxies. Again in these plots we have used only 
galaxies in the testing set which have not been used in the train- 
ing process. We find that some line equivalent widths are relatively 
well predicted by galaxy magnitudes and colours. This is the case 
for the hydrogen lines where a reasonably strong correlation was 
found. We have found however that the collisional lines observed 
are harder to predict. For instance the [O I] line equivalent width 
had virtually no correlation with any combination of the colours 
and magnitudes found. 

We compare results for both of the techniques using the same 
testing sets in FiglH We plot them side by side so that we can see 
the relative comparison between the neural network non-linear fit 
to the data and the locally linear regression used. We have inspected 
the data and found that there was no significant difference between 
both methods. Both predictions have yielded results with a very 
similar scatter given the data volume available. We conclude that 



for fitting purposes neural networks and a Locally Weighted Re- 
gression performs well and on an equal footing. 

We have also attempted to simply classify objects into line- 
emitting and non line-emitting. We have made this separation for 
each galaxy upon the simple criterium: whether an emission line is 
detected above a certain value. This separation has been done for 
each line emission analyzed. For instance if a certain galaxy has an 
[O I] line but does not posses an Ha line then it is considered as an 
emitting object for the purpose of the analysis on the [O I] line but 
it is considered as a non-emitting object for the purposes of the Ha 
line analysis. 

We have then used the training set in the following way: we 
have assigned the value of one to emitting objects and the value of 
zero to 'non-emitting' objects, defined by the line being detected 
in our code above a signal to noise of three. A neural network was 
then trained in this specific training set. When a neural network is 
trained by binary values between zero and one, it returns a probabil- 
ity (e.g. Lahav et al. 1996 and refereces therin) of that given object 
belonging to the class described by the binary number. So in our 
case the value returned by the neural network will be the probability 
of that object having the emission feature tested for. We have also 
attempted to perform the same analysis with the LWR technique 
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Figure 5. Histograms of the log of the equivalent width for SDSS galaxies. 
Each plot represents the results for one line. The black histogram are for 
all the objects for which a line was detected in the analysis of the SDSS 
data, we have then chosen a median value for the log of the equivalent 
width for these lines and separated the black samples into a sample with 
predicted equivalent width higher than this median value and a sample with 
the predicted value lower than this median value. This shows the amount of 
overlap that there is in predicting line features with the methods proposed. 
We can clearly see bi-modality of galaxies and by looking at the H^^ results 
we can also see that our methods go further than simply detecting this bi- 
modality clearly separating objects with high and low equivalent width. 



but found that despite the LWR technique producing predictive re- 
sults which are comparable and as good as the neural network fit 
it was not able to perform this classification task nearly as well as 
the neural networks. This was to be expected as the LWR method 
is not directly applicable to classification tasks. This is because the 
linear fit results are not representative of a probability range in the 
same way the neural networks are. 

We plot these results in Figl6l We can see that the results are 
varied and depend on the emission line being assessed. For instance 
there is a good prospect of training a neural network to distinguish 
between Hydrogen emitters and [N ll] emitters at low redshift but 
on the other hand the [O I] line exhibits a very poor correlation be- 
tween colours and magnitudes and its EW. The other oxygen lines 
exhibit some correlation that are picked up by our methods. 

In order to have an idea of what contamination would be intro- 
duced but performing a selection based on the predicted equivalent 
width of a galaxy with our method, we have plotted histograms for 
the entire sample and then chosen the center of the distribution of 
the equivalent widths in log space. We have then plotted the his- 
tograms for the samples with high predicted equivalent width and 
low predicted equivalent width in Fig|5] 

While predicting the equivalent widths for these emission 




Figure 6. Histogram of the probability that a galaxy is a line-emitting 
galaxy or a non line-emitting galaxy given by the neural networks. The 
black histogram represent the entire population. The red and blue his- 
tograms represent the line emitting objects and the non line emitting objects. 
We can see that we can clearly separate the blue and red points for some of 
the cases whereas other lines cannot be diagnosed with this method. 



lines we have found the already known result of galaxy bi-modality. 
We know that galaxies are bimodal and that late type galaxies have 
higher star formation rates and therefore stronger emission features 
whereas the opposite is true for late type galaxies. We can see this 
bi-modality in the histogram of the equivalent widths for the differ- 
ent lines in Fig|5] This is not found on every set of lines because 
there are selection effects (fainter lines are simply not seen in some 
galaxies). For instance the ratio of the Ha and H/3 lines for a single 
galaxy should be roughly constant just below three. This is because 
both lines are recombination lines and the temperature dependence 
of the emissivity is roughly the same for both lines. However we do 
not find this bimodality for Hp but we find it for Ha. This is simply 
because the H^^ flux is smaller and this makes so that a lot of the 
fainter lines remain undetected. Furthermore we argue that the cor- 
relations found in H/3 are strong which suggests that the method we 
are using is going beyond simply separating the bi-modality and is 
classifying the line widths according to their size. 

4 METHODOLOGY: AGN/STAR-FORMATION 
CLASSIFICATION 

4.1 Traditional Spectral galaxy classification 

We adopted a traditional procedure to classify galaxies accord- 
ing to their emission line properties. By examining diagnostic dia- 
grams formed by line ratios of optical emission lines, such as the 
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Figure 7. Classification of objects as AGN, SF or passive objects via neural 
networks, the points are colour coded according to the spectral classification 
blue, green, orange, red noting passive galaxies, AGN, hybrid objects and 
star forming galaxies respectively. Each classification has been assigned a 
doublet in the following way: passive galaxies (0,0), AGN (1,0), SF galaxies 
(0,1) and hybrid objects (0.5,0.5). The neural networks has been used to 
estimate this and each point is a galaxy is plotted at (el, e2) = (el + 
e2/2) i + e2 j. As we can see it is possible to predict AGN features from 
broad band photometry well. There is a degeneracy between the colours of 
many passive galaxies and star forming ones but it is in general possible to 
classify them with colours only. As expected hybrid objects lie on the line 
between AGN and SF galaxies. 



Om]/H/3 versus [N ll]/Ha diagram proposed by iBaldwin et alj 
198lh . we can distinguish emission-line galaxies according to the 
mechanism responsible for producing the lines. In such diagrams, 
hosts of active nuclei (AGN) and star-forming galaxies form two 
very distinct branches, or wings, making easier the task to separate 
them. We use this 'convenient' diagram to c lassify galaxies f ollow - 
ing the classification scheme discussed in IStasinska et alj (|200^ 
based on a theoretical curve used to distinguish galaxies with pure 
star-formation from those objects with some contribution from nu- 
clear activity to the line intensities. Galaxies below such a curve 
are classified as normal star-forming galaxies (S F). In addition, we 
have a lso used the empirical curve proposed by iKauffmann et al.l 
(l2003h to identify the AGN hosts. Galaxies between the two curves 
are hybrid objects (the contribution of the AGN to the H/3emission 
of galaxies below the Kauffmann et al. line is at most 3 per cent) 
and therefore will not be included in our analysis. The AGN hosts 
are then selected as those objects located above the Kauffmann et 
al. empirical curve. 



4.2 Automated Spectral galaxy classification 

In this subsection we attempt to classify the spectral features in- 
stead of simply predicting the equivalent width of lines. We have 
described in Sec H.ll how we separate galaxies with emission fea- 
tures into AGN or star forming (SF) galaxies. 

We have separated the 20000 SDSS galaxies we used in this 
paper into the following classes: passive galaxies, AGN, star form- 
ing galaxies and hybrid objects. Some objects were not classified 



by the code analysing the spectroscopic data and we have removed 
those galaxies. We have associated a doublet (el, e2) to each one of 
these categories in the following way: passive galaxies (0,0), AGN 
(1,0), SF galaxies (0,1) and hybrid objects (0.5,0.5). 

We have then used a neural network with 5 input nodes, 2 hid- 
den layers with 10 nodes in each layer and 2 output nodes to clas- 
sify galaxies into AGN, passive galaxies and SF galaxies according 
to the doublets assigned above. We have used only photometry to 
perform the training. We used 5000 galaxies for training and 3000 
for validating the network. The remaining galaxies were plotted 
as a testing set in FiglT] The network returned a doublet for each 
galaxy. For visual reasons, we plot the doublets in the following 
basis (el, e2) = (el + e2/2) i + e2 j, so that the labels AGN, SF 
galaxies and passive galaxies are on the corners of an equilateral 
triangle. 

As we can see from FiglT] there is a trend on the distribution 
of points in this (el, e2) diagram. The distribution does not infer a 
transition between Passive galaxies to SF galaxies to AGN. It indi- 
cates that the colour data is able to make a good distinction between 
passive galaxies and AGN, there is a relativelly good distinction be- 
tween AGN and SF galaxies and a certain overlap between passive 
and SF galaxies, which we had already found attempting to classify 
galaxies in the previous section. This offers an alternative to classi- 
fying objects photometrically as AGN or star forming galaxies via 
a training set and broadband photometry only. 



5 PROSPECTS FOR FUTURE SURVEYS. 

We have shown that it is possible to predict emission features us- 
ing broad band photometry. We find that the correlations vary in 
strength from line to line. Particularly recombination lines are more 
predictable than collisional lines. We have not included any infor- 
mation on the shape or size of the galaxies. We expect that the re- 
sults found here would be stronger if information such as shapelet 
coefficients for galaxies is included into the analysis as we know 
that galaxies with different star formation rates and therefore dif- 
ferent line strengths have different morphologies. 

We argue that this technique can be used in future spectro- 
scopic redshift surveys to speed them up. Future facilities such as 
FMOS or WFMOS will produce spectroscopic surveys in the opti- 
cal and the IR part of the spectrum targeting mainly the [O II] and 
the Ha lines. We have shown here that there is a reasonable pre- 
dictability of the [O II] line at high redshift and that the Ha and H/3 
line are well predicted in the low redshift Universe. A high redshift 
complete sample of around 10000 objects could be used to train a 
network to then select the preferred targets. 

If one chooses to predict the line strengths one alternative 
would be to fit stellar population synthesis models to the observed 
spectrum and from that infer a star formation rate and therefore 
a line flux. We argue that the technique we have proposed here is 
complementary and one would be able to encode information on the 
morphology of the galaxy easily whereas if one fits stellar popula- 
tion models to the training data it would be hard to include special 
information in the analysis. 



6 CONCLUDING REMARKS 

We have looked, in this paper, for empirical relations between the 
equivalent width for several different lines and the broad band 
colours for the same objects. We used an automated way to explore 
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the correlations found in the data via the use of LWR and ANN. 
The two methods give very similar results. 

We have performed the analysis in two samples, a low red- 
shift sample obtained from the Sloan Digital Sky Survey covering 
the nearby Universe and a high redshift sample from the first data 
release of the DEEP survey covering a deeper sample off the Uni- 
verse. We found that in the DEEP data one could predict the equiva- 
lent width of the [O ll] line relatively well by looking at broad band 
colours only. With the SDSS data six lines were predicted from a 
training set sample. In general collisional lines presented little cor- 
relation although some of them were reasonably predicted by the 
colours. There was a stronger correlation found in recombination 
lines. 

We have compared the power of prediction of both methods 
used. We have concluded that both Artificial Neural Networks and 
the Locally Weighted Regression methods are capable of recover- 
ing most of the information encoded in the training set with lit- 
tle statistical difference between both methods. We have used both 
methods for classification purposes and found that the Neural Net- 
works are capable of classifying the objects into line-emitting and 
non line-emitting but that the Locally Weighted Regression method 
was unable to do so. We have shown that it is possible to classify 
galaxies into AGN, passive galaxies and line emitting galaxies well 
without galaxy spectra, using solely colours and a training set of 
the order of 15000 galaxies. 

We have discussed the prospects of speeding up redshift sur- 
veys with this and we conclude that with a reasonable training set 
of the order of 10000 galaxies one would be able to considerably 
speed up future surveys done with instruments such as FMOS and 
WFMOS. Furthermore, in this paper we have only taken advantage 
of the colour data; however the methods described can accommo- 
date very easily other information such as morphology or size. 
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APPENDIX A: LOCALLY WEIGHTED REGRESSION 

Locally Weighted Regression (LWR) is a machine learning method 
capabl e of mapping comp lex, non-linear relations between vari- 
ables ( Atkeson etalJll997h . 

Like any other global fits, LWR minimises a quantity related to 
the chi-square. The difference is that each data point of the training 
set has a corresponding weight, which depends on a given query 
point. Thus, the fitting parameters are valid only locally for that 
particular query point. In this method any data available for the 
machine learning would be separated into a training set and a vali- 
dation set. 

In this method, the quantity to be minimised for each query 
point is given by 

E^^wUy^-f{^i)f (Al) 

i 

where / is the linear function to be fitted for the weights for each 
point Wi depend on the distance between the query point and the 
data point of the training set. The sum over i is performed for each 
point of the training set. Here the weights have been chosen to fol- 
low 

= exp (-D^(xi, xq)/2ii^^) (A2) 
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where the function D is the EucUdean distance between the query 
point Xq and the training data point Xi. The parameters K is called 
the kernel width, related to the width of the Gaussian determin- 
ing the weights. For reasonable values of K only the data near the 
query point have significant contribution to the local fit. The opti- 
mal value for the parameter K for a given training set is found by 
minimising the error found in a validation set. In the training stage 
many kernel values are attempted and the one which produces best 
results for the validation set is chosen and fixed. Once K is chosen 
with the aid of the training and validation set, one is left to define 
how / is chosen. 

We choose to expand the fitting function f 

/(x) = f3iti{x) + f32t2{x) + ... + f3MtM(x) (A3) 

where the functions ti are formed by linear combinations of the 
inputs Xi; i.e. ti = 1, t2 = xi, ts = xf, and so forth. This equation 
can be written as 

fix) = (3^t{x) (A4) 

where t{x) is the vector of polynomial terms. Here the weights 
are recomputed according the distance between points Xk and the 
query point and the matrix (3 is computed according to 

/3=(X^X)-^X^2/ (A5) 
where 

N 

(X^X)i^- = ^^Witiixk)tj(xk) (A6) 
k=i 

and 

N 

(X^2/)^ = ^^Witi{xk)yt (A7) 
k=i 

hence we ca n solve for the be st solution by using a Cholesky de- 
composition ( Press et aDll992 h. Now that the (3 vector and the ker- 
nel are chosen we simply apply the relation using the testing set as 
the query points. 



APPENDIX B: ARTIFICIAL NEURAL NETWORKS 

We use a particular species of ANN known formally as a multi- 
layer perceptron (MLP). A MLP cons ists of a number of layers 
of nodes (Fig. lBli see e.g. lBishopIl 19951 . and references therein, for 
background). The first layer contains the inputs, which in this paper 
are the magnitudes, m^, of a galaxy in a number of filters (for ease 
of notation we arrange these in a vector m = (mi, m2, mrin))- 
The final layer contains the outputs; here the equivalent width or 
emission probability. Intervening layers are described as hidden 
and there is complete freedom over the number and size of hid- 
den layers used. The nodes in a given layer are connected to all 
the nodes in adjacent layers. A particular network architecture may 
be denoted by Nin'Ni:N2: . . . :A^out where iVin is the number of 
input nodes, A^i is the number of nodes in the first hidden layer, 
and so on. For example 9:6:1 takes 9 inputs, has 6 nodes in a single 
hidden layer and gives a single output. 

Each connection carries a weight, Wij ; these comprise the vec- 
tor of coefficients, w, which are to be optimised. An activation 
function, gj{uj), is defined at each node, taking as its argument 

Uj - ^^^Wijgi{ui), (Bl) 

i 



nput layer — > Hidden layer Output layer 




-> z 



Figure Bl. A schematic diagram of a multi-layer perceptron, as imple- 
mented by ANNz, with input nodes taking, for example, magnitudes rrii = 
—2.5 log 10 fi in various filters, a single hidden layer, and a single output 
node. The architecture is n:p:l in the notation used. Each connecting line 
carries a weight Wij . The bias node allows for an additive constant in the 
network function defined at each node. More complex networks can have 
additional hidden layers and/or outputs. Here the equivalent width takes the 
same role as the redshift. 

where the sum is over all nodes i sending connections to node j. 
The activation functions are typically taken (in analogy to biolog- 
ical neurons) to be sigmoid functions such as gj(uj) — 1/[1 + 
exp {—Uj)], and we follow this approach here. An extra input node 
- the bias node - is automatically included to allow for additive 
constants in these functions. 

For a particular input vector, the output vector of the network 
is determined by progressing sequentially through the network lay- 
ers, from inputs to outputs, calculating the activation of each node 
(hence this type of neural network is often referred to as a feed- 
forward network). 

Given a suitable training set of galaxies for which we have 
both photometry, m, and here the Equivalent width, ^Wtrain, the 
ANN is trained by minimising the cost function 

E = ^(Wout(w, irifc) - Wtrain,/e)', (B2) 

k 

with respect to the weights, w, where ^Wout(w, irifc) is the net- 
work output for the given input and weight vectors, and the sum 
is over the galaxies in the training set. To ensure that the weights 
are regularised (i.e. that they do not become too large), an extra 
quadratic cost term 

E^ = f3^wl, (B3) 

is added to equation lB2l 

We use an iterative quasi-Newton method to perform this min- 
imisation. Details of the minimis ation algorithm an d regularisation 
may be found in iBishod (Il995h and lLahav etaP fl996. Appen- 
dices). 

After each training iteration, the cost function is also evalu- 
ated on a separate validation set. After a chosen number of train- 
ing iterations, training terminates and the final weights chosen for 
the ANN are those from the iteration at which the cost function is 
minimal on the validation set. This is useful to avoid over-fitting 
to the training set if the training set is small. The trained network 
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may then be presented with previously unseen input vectors, and 
the outputs computed. 

To implement the ANN^ code to our problem all that is 
needed is to regard the output node as the equivalent width instead 
of as the redshift. In this work we found it optimal to work with the 
log(EW). 



^ http ://zuserver2 . star. ucl . ac .uk/~lahav/annz . html 



